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Abstract 


Consider observations yi,... ,yn on nodes of a connected graph, where the y* independently come 
from N{9i,a‘^) distributions and an unknown partition divides the n observations into blocks. One 
well-studied class of change point problems assumes the means 9i are equal for all nodes within 
contiguous blocks of a simple graph of sequential observations; both frequentist and Bayesian 
approaches have been used to estimate the 9i and the change points of the underlying partition. 
This paper examines a broad class of change point problems on general connected graphs in which a 
regression model is assumed to apply within each block of the partition of the graph. This general 
class also supports multivariate change point problems. We use Bayesian methods to estimate 
change points or block boundaries of the underlying partition. This paper presents the methodology 
for the general class of change point problems and develops new algorithms for implementation via 
Markov Chain Monte Carlo. The paper concludes with simulations and real data examples to 
demonstrate application of the methodology on a wide range of problems. 

Keywords: change points, Bayesian methods, structural change, product partition models 
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1. INTRODUCTION 


The classical change point problem considers sequential observations yi,... ,yn, assumed to come 
from distributions. It assumes that an unknown partition divides the n observations 

into blocks. The goal is to recover the 9i and, perhaps, to infer the location of change points 

;his classical change point problem and its 


close relatives; see, for example. 

Chen and Gupta 

2012). The original Bayesian approach to this 

problem was developed in Barrv and Hartigan 

1991 

1), which models the unobserved partition using 

a product partition distribution 

Hartigan 

199( 

])• 


This article proposes new Bayesian methodology, also using product partition distributions, 
that tackles a more general change point problem. We consider observations residing 

at n nodes of a connected graph and assume an unknown underlying partition divides the nodes 
into blocks sharing the same distribution parameters. Two nodes joined by an edge are more 
likely to be in the same block than are non-neighboring nodes, although nodes of a block need 
not be contiguous. We consider a linear model regressing y on x within each block. This general 
methodology, fully specified in Section [21 is sufficiently flexible to support many different forms of 
change point analysis. 

Special cases encompassed by this generalized change point problem have been studied pre- 
viously. In most inst a nces, as in the classical change point problem studied by, for example. 


Erdman and Emerson! (j2007l . 


2003 ), observations are assumed to be sequential; this implies a 


path graph as the underlying structure. The observ ations can also 


common ch ange point structure across all dimensions ([Perreault et al 


2006 


3e multivariate, sharing a 


2 OOOI : 


2009lf . Some app r oache s additionally allow for changes in variance ([James and Matteson 


2006 


Zamba and Hawkins 


2014 


Killick et al. 2012 

. 2014). Popular areas of application indue 

e the environmental sciences 

(Perreault et al. 

2000 

; 

Jhen and Gupta 

2012 

) and finance ( 

Holbert 

1982; 

Lavielle and Tevssiere 


James and Matteson 


els within b’ 


Seidou et al. 


20071 : 


2014|j. Another variation uses serial o 


ocks (Zeileis et al 


Lose 


ri et al 


3servations to fit linear mod- 


2002; 


Bai and Perron 


2003 


Muggeo 


20031 . 


200a 


Eearnhead 


2 OIOIL App lications are found in econ ometrics (iLoschi et al.l 


envir onmental sciences ([Qian and Rvull2006l f , and biomedical sciences (iMuggeo 


2010|). And motivated by the application of image restoration, 


2003; 




Barry and HartiganI ([l994l i presented 


2003; 


2OIOIL 


Muggeo and Adelfio 
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a Bayesian methodology for conducting classical change point analysis on a grid graph. We will 
show that many of these change point problems are special cases of our general form of the problem. 

We present the theoretical construction and implementation of our methodology in Section [21 
selected relevant derivations appear in Appendix A. We show that our methodology encompasses 
(but is not limited to) several important constant-variance change point problems, including multi¬ 
variate change point analysis for sequential data, univariate change point analysis for data on a grid 
graph, and linear regression change point analysis for data on a general graph. Section [3| provides 
simulation results comparing available methodologies on grid graph problems. Most significantly, 
the new BCP-Graph-0 algorithm demonstrates robustness with respect to the often difficult choice 
of an important tuning parameter. Section [3| also applies the methodology to real data problems. 
Section m offers a brief discussion. Our methodology is newly implemented in a major revision of R 
package bcp (version 4.0.0). 


2. METHODOLOGY 


2.1 Model 

We consider a connected graph with n nodes. An observation {xi,yi) is recorded at the Lth node, 
where Xi = (xn,..., Xik) is a vector of values for k predicting variables. Given a partition p 
dividing all nodes into b blocks, each block S is associated with a vector ys and a matrix Xs', ys 
contains all observed yi for nodes i in block S and Xs contains the corresponding Xi as rows. For 
a block containing ns observations, denote Xs to be the ns x (k + 1 ) matrix with a column of 
ones prepended to {Ins ~ l/^S''Ais)Y 5 , a transformed design matrix with each predicting variable 
centered about its blockwise mean. Ip is the p x p identity matrix and Jp is the p x p matrix of 
ones. We assume ys ~ Nns{IIsls,'^‘^Ins) with 75 = {as,/3si, ■ ■ ■ a-iid unknown variance 

The prior for intercept as is 


as ~ N 


«0) 



( 1 ) 


V ns/ 

where ao ~ U{—oo, 00). Without any constraints on the size of a block in a partition, it is possible 
for a block to have too few observations to fit the model coefficients. Rather than assigning zero 
mass in the prior for all snch partitions p having at least one small block, our model fits only an 
intercept in small blocks (having 2k or fewer observations). For other blocks, the prior is a mixture 
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of the intercept-only model and the full regression model. A parameter ts accomplishes this via 
the following prior: 

P{ts = 0) = — - —:l{n 5 > 2k} + l{ns' < 2k} and 
ns + d 

P{ts = 1) = l{ns > 2k}. 

ns + d 

If Ts = 0, then our model assumes only an intercept within block S and, consequently, /3sj = 0 for 

all j; if Ts' = 1, then we use the full linear model for block S and adopt the following prior on the 

regression parameters: Psj ~ dV (o, aj/Vsjj^ , where Vsjj is the {j,j)-th element of Vs = XjXs- 

We use a different cr| for each predictor to account for changes in variance across different predictors. 

Instead of imposing priors on each (t| directly, we consider priors on the error-to-signal ratios 
2 

~ ~ ^J = 0) • • •) k. The w'- are hyperparameters requiring user input; w'- = 0.2 

for all j works well in most settings. In addition, we use an improper prior on the error variance: 
'7r((T^) oc l/u^, where € (0, oo). Finally, we use the following prior on the partition p\ 

Tr{p) (X (2) 


where a < 1 is a pre-specified parameter and l{p) is the total boundary length of the partition. 


calculated as Ylj=i with l{Sj) as the number of nodes that are neighbors of Sj. A neighbor of 

block Sj is not in Sj but shar e s at le ast one edge with a node in Sj. This prior is similar to the prior 


given m 


Barry and HartiganI (|l994l l for classical change point analysis on a grid graph and shares 


the same “short boundary” property that encourages the number of adjacent nodes belonging to 
different blocks to be small. Intuitively, a represents the multiplicative penalty in the likelihood 
for each additional boundary edge, a < 1 implies a preference for shorter boundaries over longer 
ones; the smaller the a, the heavier the penalty for each boundary edge. 


2.2 Conditional Distributions and Expectations 

We begin with some notation before presenting the relevant formulas used in implementing our 
methodology. Given a partition, let W = ~ “ vY be the 

within-block and between-block sums of squares, respectively, where ys is the mean of yi within 
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block S and y is the overall mean of all y*. Define 

/ i-^’o 0 


Zs = 


nsWQ 

0 


1—wi 

Vsii^i 


0 

0 


VO 0 ... 0 

For a matrix Q, denote (5(-i,-i) as the submatrix formed by removing the first row and first 
column from Q, and denote Q-i as the submatrix formed by removing the first row from Q. 
Let 0s be the posterior mean of the slope coefficient(s) in block S given w and ts = 1, where 
0s = [{XjXs + Zg0-^Xjys] Finally, let fF = ID - E 5 :rs=i Psi^S^S + ^ 5 ')(-i,-i)/3s. 

This model leads to a number of conditional distributions (derived in the Appendix) used in 
the implementation. For B > 0, 


f{p, r\y, X, w) (xf(Tlp)f(p)f(ylx, p, w, r 

d 


oc 


n 


s ■- 


X a 


+ 


Kp) 


ns + d 
ns 


ns + d 


> 2A:} + llns < 2k} ) l{r 5 = 0} 
y 

l{ns > 2 /c}^ 1{ts = 1 } 


Wi 


/(fe-l)/2 
0 


nS:rs = l\(^sXsZs+ /)(-!,- 1 ) 


- 1/2 


X Beta 


B(b+l)/2wk-b-2)/2 
BwqIW 6 + 1 n — b — 2 


1 + Bw'q/W 
f{w\y,x,p,T) oc/(y|ai,/5,m,r)/(m) 


’ 2 ’ 


and 


(3) 


w, 


l{b-l)/2 
0 


oc- 


= l + /)(-!,-1) 


- 1/2 


X Beta 


5('>+l)/2p[y(n-fe-2)/2 
Bw'qIW 6 + 1 n — b — 2 


(4) 


^i + Bwyw 2 2 y 

If instead B = 0 (which certainly occurs when 6 = 1 but might even occur with real data in 
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non-trivial cases), then 


fiP,'T\y,x,w) ocJJ 


s 


d 


X a 


+ 

Kp) 


ns + d 
ns 


ns + d 


> 2 /c} + l{ns < 2k} ) l{r 5 = 0 } 
l{ns > 2 A:}) 1{ts = 1 } 


w: 


X 


{XjXsZs +1)^-1,-i) 


- 1/2 


f{w\y,x,p,T) cx- 


Wn 


(XJX5ZS + /)(_!,_1) 


and 


-1/2 




( 5 ) 

( 6 ) 


Define Wq = E{wo\y,x, p,t). The conditional expectations, given below, are used for estimating 
the posterior means: 


E{as\y, X, p, w, r) = (1 - -^ 0)^5 + WQy, 

( 7 ) 

E{f3s\y,x,p,w,T) = Ps, and 

(8) 

E{a^\y,x,p,w,T) = ° . 

n — 6 

(9) 


For i? = 0, it is easy to see that Wq = Wq/2 and for > 0 

~ Beta ( 

W 


Wn = 


Bw'qIW^ . 6+3 n-b-4 
\l+Bw'g/W' 2 ’ 2 


B 


Beta 


( 


Bw'qIW^ ■ k±l ri-b-2 
Kl+Bui'^/W' 2 ’ 2 


2.3 Implementation via Markov Chain Monte Carlo 

Due to the complexity of the posterior distributions, we use Markov Chain Monte Carlo (MCMC) 
to sample the partit i on p and r. Specifically, we use a combination of pixel passes adapted from 


Barry and HartiganI (| 1994 1 and two new passes. We describe each of these below. 

The pixel passes make use of ([3]) and dS} for sampling. An All (Full) Pixel Pass (FPP) iterates 
over each node in turn, allowing the current node to join any one of the other 6—1 blocks, stay in 
its current block, or form a new block while simultaneously updating +5 for the affected block(s). 
FPPs are computationally intensive, requiring approximately 0{nb) likelihood calculations at each 
pass (where 6 may change during the pass). 





















Active Pixel Passes (APPs) take advantage of the fact that nodes interior to a block are unlikely 
to switch memberships because of the heavy likelihood penalty that would be incurred for new 
boundary edges. By construction, APPs skip nodes that are in the interior of a block and only 
consider membership updates for nodes that are active, residing at the boundary of a block. If 
node i is active, then one of two things happen: 1) if node i agrees with at least one other neighbor, 
then the node may join a neighboring block and 2) if, instead, node i is an island (disagreeing with 
all of its neighbors), then it is only allowed to join blocks that preserve its island status. Again, 
as we determine the new block membership for node i, we simultaneously consider updating r for 
the affected block(s). The preservation of island nodes is a key r equirement in the conditioned 


transition argument that justifies APPs (|Barrv and Hartigan 


19941 ) while simultaneously being at 


odds with the short boundary philosophy. Our simulations in Section [3. 3l suggest that the treatment 
of island nodes in the APP causes the number of blocks h to be extremely sensitive to a; when a 
is slightly larger than the acceptable range, the sampled partitions have too many island nodes, 
increasing the runtime and resulting in noisy posterior means. 

We propose a pseudo-APP that modifies the APP only in its treatment of island nodes. A 
pseudo-APP treats all active nodes, island or otherwise, the same way, allowing the node to join 
any of its neighboring blocks or stay in its current block. Simulations suggest that using the pseudo- 
APP in place of the original APP increases robustness to a, widening the acceptable range of a 
values significantly (Section l3.3p . Unfortunately, the modified pseudo-APP breaks the conditioned 
transition argument and is no longer a legal MCMC move. Although the resulting sample is not 
guaranteed to come from the posterior distribution /(p, rjy), our simulations show that for short¬ 
boundary scenes, pseudo-APPs produce estimates that are more in line with the ground truth 
than are estimates produced by original APPs. To make a fair comparison, we will also illustrate 
some scenarios where the pseudo-APPs fare worse. Our software implementation allows the user 
to specify the proportion of APPs that are pseudo-APPs. 

In addition to the pixel passes, we also use a block merge pass and a w-pass. The block merge 
pass iterates over each block Sj for j = 1,... ,b. For block Sj, considering merging Sj with another 
block S in the current partition satisfying ts = ts^ . The w-pass allows for iteratively updating wi 
foT I = 1,..., k using (j3]) and ([6]). 
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The algorithm begins with 100 FPPs that are discarded as part of the burn-in. We then proceed 
with M additional steps, each consisting of 1 FPP, 20 APPs, 1 block merge pass, and 1 rc-pass; 
some of these steps may be additionally discarded as part of the bnrn-in. After discarding a number 
of iterations for the burn-in, each pass through the data ends in a calculation of the conditional 
expectations given by ([7]), ([8]), and Q. Given a partition p at the t-th MCMC step, we calculate a 
posterior mean yu for observation i in block S. Finally, we obtain our estimates for the posterior 
means {y*} and its associated variances by aggregating over the conditional expectations calculated 
in each pass. 

Note that when all Xij are equal within a block S for a predictor j, a singularity in XgXs 
renders certain calculations in the algorithm impossible. This can happen, for example, if variable 
j is discrete-valued. In such situations, our algorithm temporarily adds a small amount of noise 
to the data for the calculation in which the singularity is encountered. The noise is independently 
regenerated for subsequent calculations when necessary. 


2.4 Special Cases 

Classical change point analysis uses sequential observations, implying a path graph with a partition 
of consecutive blocks. In this setting, blocks differ in the mean parameter of the underlying normal 
distributions. Only fitting a mean within each block is equivalent to fitting trivial intercept-only 
linear models within each block. Therefore, classical change point is a special case of the change 
point problem on a graph with linear regression. 

When the under l ying g raph structure is a path graph, we recommend borrowing elements from 


Barry and HartiganI ( 1993l l. replacing the partition prior ([2]) with vr(/?) = fQ° ^(1 — p)^ ^dp for 


some prespecified pq. This prior reflects the assumption that each node is equally likely to be 
a boundary node with equal probability p. The MCMC algorithm simplifies greatly with a path 
graph; rather than making MCMC moves that alter the block membership node by node, we instead 
consider breaking or merging blocks at each possible change point location. 

In multivariate change point analysis, we fit multivariate normal distributions, thereby esti¬ 
mating a vector of means within each block. We can accomplish this in the general change point 
framework by realizing that fitting a vector of means is equivalent to performing a one-way anal¬ 
ysis of variance. To conduct fc-dimensional change point analysis for n observations, we view our 
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observed data at node i as yi = (ya, • • •, Uik)^ and Xi = Ik- Since all of the coefficients ai ,... ,ak 
in this model are means, we directly use the design matrix Xs without centering and adopt the 
intercept prior ([T]) for all aj. Moreover, because there are no coefficients to estimate, we effectively 
set rg = 0 for all blocks S in the partition, and there is no need to sample r. In the same vein, our 
method can be generalized to include multiple observations at each node, even in the regression 
setting. 


3. SIMULATIONS AND APPLICATIONS 
3.1 Multivariate versus Univariate Change Point Analysis 

Multivariate change point analysis is an easier problem than univariate change point analysis; 
added dimensions yield more information on change point locations because the change points are 
replicated over each dimension. If the noise level is high, a univariate signal may get lost in the noise 
far more easily than a multivariate one. Figure [T] shows the results of our method on a univariate 
example and a multivariate example sharing the same change point structure. The dataset on 
the left is one-dimensional and the dataset on the right is five-dimensional. Both datasets in the 
example have the same underlying change point at i = 50; the means are 0 and 1 in the hrst 
and second blocks across each dimension. The algorithm struggles to decisively identify the single 
change point using the univariate series. In contrast, the multivariate change point results show a 
high estimate for the posterior probability of a change point at the true location. 

[Figure 1 about here.] 


3.2 Quebec Streamflow 


Perreault et al 


(|2000ll applied their multivariate single change point detection methodology to 
annual January to June streamflow (measured in L/{km? -s)) for six rivers in Quebec, Canada. We 


riili 


examine data from four of the six riverqj in their study; the other two rivers appear in the database 
but th eir recorded characteristics differ substantially from the descriptions given in 


3 


Perreault et al 


(I 2 OOOI I and are not included in our example. Using their algorithm, the authors found the year 


1984 to be the mode of the marginal posterior distribution of the change point locations, and 


^http://www.wsc.ec.gc.ca/applications/H20/index-eng.cfm 
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gave estimates for the means before and after the change point. Our results for the four rivers 
(Romaine, A la Baleine, Churchill Falls, and Manicouagan) are given in Figure [21 Our method 
confirms the year 1984 as a likely change point, but additionally discovers possible change points 
near the beginning of the time period. 


[Figure 2 about here.] 


3.3 Simulations Involving Change Points on a Grid 

We co nsidered 20 simulated scenes on a 20-by-20 grid graph to compare the method in 


Barry and Hartigan 


(jl994li (BCP-Grid) against our method BCP-Graph; our method assumes up to eight neighbors per 


node instead of BCP-Grid’s four, by additionally including diagonal edges between nodes. These 
20 scenes (Figure [3]) include a range of characteristics representative of real datasets that are ap¬ 
propriate for both methods. For each scene, we simulated 10 datasets and ran each method using 
a range of values for the parameter a. We used M = 2000 steps with the first 1000 steps discarded 
as part of the burn-in. 


[Figure 3 about here.] 

We plot the mean squared error (MSE) as the main measure of performance in Figure 01 BCP- 
Graph-1 corresponds to our method using Barry and Hartigan’s APPs; BCP-Graph-0 corresponds 
to our method using pseudo-APPs. While comparable levels of MSE are achievable across all 
three methods for most scenes, BGP-Grid and BCP-Graph-1 both require careful selection of a. 
BCP-Graph-1 has a wider range of acceptable values for a relative to BCP-Grid. Overall, scenes 
with shorter boundaries (Scenes 1-11) favor lower values of a. As a becomes larger, both of these 
algorithms iterate over partitions with a larger number of blocks, increasing the MSE. Of the three 
methods, BGP-Graph-0 is the most robust to the choice of a for these shorter boundary scenes, 
because pseudo-APPs eliminate small isolated blocks in the partition. However, BGP-Graph-0 
performs poorly in longer boundary scenes, which have a large number of isolated islands in the 
underlying true partition (Scenes 15 through 19). 

[Figure 4 about here.] 
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Plots comparing runtime and mean number of blocks for these simulations can be found in 
Appendix B. Independently of the choice of a, BCP-Graph-0 is far more computationally efficient 
than either BCP-Grid or BGP-Graph-1 and also encourages partitions having smaller numbers of 
blocks. 


3.4 New Haven Real Estate Values 

We conclude by applying our method to an example modeling housing values. Housing values 
could vary with observed characteristics such as the number of bathrooms, number of bedrooms, 
and living area, or with unobserved characteristics such as neighborhood structure. A single linear 
model applied to all properties may not be appropriate; different associations between the variables 
might exist across different neighborhoods. To more accurately model housing values over many 
neighborhoods via linear regression, we would need a priori neighborhood boundaries; these may 
not be available or inaccurately estimated. Our method, however, handles the situation gracefully 
because it considers partitions (neighborhoods) that are not pre-specified and allows slopes and 
intercepts to vary from one neighborhood to another. 

Our example uses New Haven, Gonnecticut residential property dati^, and latitudes and longi¬ 
tudes obtained with the Google Maps API. We have limited the data to all 244 houses (excluding 
condominiums and apartments) within the region outlined by the dashed lines in Figure [5l We 
chose this region because it consists of roughly three value-differentiating neighborhoods that are 
separated with respect to longitude. We model the 2011 log assessed value as a function of the 
square root of living area, lot size, and number of bedrooms. Although location is not directly 
used in the linear models, the graph structure and exploration of partitions of the properties into 
neighborhoods introduces a spatial component to the analysis. 

To use the location of each house in the analysis, we used Euclidean distance on the lo ngitudes 


and latitudes in generating a minimum spanning tree (|Pri: 


riiJ 


1957 


Oksanen et al. 


201 a) over all 


houses. We then ran our method with a = 0.1 and a = 0.3 using pseudo-APPs. The left panel of 
Figure [5] shows the actual properties with circles, where the size of a circle indicates the magnitude 
of the actual assessed value. The right panel shows the posterior modal partition (for both a = 0.1 
and a = 0.3) consisting of three neighborhoods. Upon visual inspection, the large neighborhood 2 


^http://data, visionappraisal.com/newhavenct/ 
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off to the west consists of houses that are tightly clustered together and are generally lower-valued 
than properties in the other neighborhoods. Neighborhood 1 to the east consists of more expensive 
houses that are more spread apart (having larger lots). Neighborhood 3 separates the other two 
neighborhoods and seem to be a mix of cheaper and more expensive houses. Both a = 0.1 and 
a = 0.3 produced the same modal partition, but a = 0.1 resulted in an average partition size of 
3.01 neighborhoods and a = 0.3 resulted in an average partition size of 4.46 neighborhoods. 

[Figure 5 about here.] 

For convenience, our subsequent discussions focus on the a = 0.1 results. In Figure E] we 
compare the residuals from three different models: (a) a single linear model without any spatial 
component, (c) our method, which allows for different coefficients within different neighborhoods, 
and (b) an intermediate model sharing characteristics of the previous two, additionally using the 
modal partition from our method as a categorical variable in the linear model (a). All models use 
the same predicting variables with the same transformations, but (b) and (c) also indirectly use 
the adjacency structure which represents the spatial locations of the houses. The residuals from 
the original linear model (a) vary greatly by longitude, suggesting that some spatial effect has not 
been incorporated. The revised linear model (b) accounts for some of the spatial effect, and has a 
much smaller spread in the residuals. Our method produces the smallest standard deviation in the 
residuals among the three approaches, and the associated plot for (c) shows the residuals to be far 
more homogeneous across longitudes. 

[Figure 6 about here.] 

One might argue that a minimum spanning tree over all houses may not be the most intuitive 
graph structure for modeling the data. We concede there are alternative approaches. A fe-nearest 
neighbors approach would require defining k beforehand and even for small values of k, such as 
A: = 4, would result in joining some houses that are very far apart. Another possible approach is to 
create the graph structure using street names and numbers to join adjacent houses; however, the 
closest house may lie around the corner on a different street. We chose to use a minimum spanning 
tree because among these choices, it is the most objective way of generating the graph. 
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4. DISCUSSION 


This article proposes a Bayesian methodology that addresses a general class of change point prob¬ 
lems. We demonstrated its use and performance on multivariate change point analysis for sequential 
data, univariate change point analysis for data on a grid graph, and linear regression change point 
analysis for data on a general graph. Our approach performs well on real data examples - Que¬ 
bec streamflow data and New Haven real estate data. For the Quebec streamflow pro blem, our 


Bayesian methodology confirmed the change point discovered by 


Perreault et al. 


(l200di and pro¬ 


vided additional information about other possible change points. For the New Haven real estate 
problem, our method identified a modal partition that was consistent with our a priori knowledge 
of the neighborhoods in that area of New Haven. 


We compared the performance of our method to BCP-Grid (|Barrv and Hartiean 


199J) using a 


diverse set of 20 simulated scenes on a grid graph. Our BCP-Graph algorithm with pseudo-APPs 
should be used with caution when many island nodes are suspected in the data. However, we 
showed that the BGP-Graph algorithms are more robust than BCP-Grid to the choice of the tun¬ 
ing parameter a, especially when using pseudo-A PPs; the original BGP-Gri d algorithm exhibited 


extreme sensitivity to the choice of a, as noted in 


Barry and HartiganI (Il994l b and is therefore not 


recommended. In general, the use of pseudo-APPs helps improve performance by preventing the 
formation of many island nodes. As a result, the use of pseudo-APPs increases the useable range 
of a when working on real data problems where many island nodes are not expected. 
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APPENDIX A. SELECTED DERIVATIONS 


he w ithin-block density, shown below, is dependent upon ts- If ts = 0, then 


Barry and Hartigan 


(jl993l l gives 


f{ys\xs, ao, rg = 0 ) oc ^ 




*^5/2 r nswoiys - aof+ J2i&siyi~ ysf 


exp 


2ct2 


If instead ts = 1, then we fit a linear model within block S: 


f{ys\xs,ls,(^‘^,Ts = 1 ) 


f 1 ' 

.ns/2 

f 1 r 


j exp < 

2 cj 2 

f 1 ' 

V ”s /2 

f 1 r 


j exp < 

2 cj 2 




I \ »^s /2 
-2 


a 


1 


{ys - ^sis) {ys - ^sis) + (7s - 7s) Xg Xs{^s - 7s) 


(A.l) 


where 75 is the ordinary least squares (OLS) estimate of the coefficient vector 75 . We write 
7 s = («S)/3s) where as = ys and (3s are the OLS estimates of the intercept and non-intercept 


components; we distinguish the OLS estimate Ps = 


{X^Xs)-^X^ys 


J -1 


from the Bayes estimate 


out 75 in terms of its components: 


Ps = (-^ 5 -^S 3- Zg^) ^Xjys ■ We consider the last term in the exponent in (lA.ip and expand 


(7s - 7s)'^^s ^s(7s - 7s) 

= ns{as - as)^ + (Ps - Ps)~’'^JZs(_i_i-)(Ps - Ps) 
= ns{as - ysf' + {Ps - PsVX'sXs(^_i_i'^{Ps - Ps)- 

Since as ~ V(q;oWo/^s)) we average (|A.ip over as to get 


f{ys\xs,oio,Wo,Ps,cr‘^,rs = 1 ) 


I \ns/2 

oc ( exp 


1 


nswoiys - ao)^ + '^{yi - ys)‘^ + {Ps - /3s)'^-^J-’^S(_i _i)(/3s - Ps) 


i£S 
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We rewrite the prior on (3$ as A^fc(O) o'^-^S(-i,-i)) and average over (3$ to arrive at 


f{ys\xs,ao,a'^,w,Ts = 1) 


1 _ l_i/2 


I + /)(_!,_!) I 

nswoivs - aof + J^ieS^jiVij “ Vsf - Psi^s^s + Z^^)(_i,_i)^s 


exp 


The joint density of all observations is then 


2c72 


f{y\x,ao,cr‘^,w,T, p) 

n/2 


OC 


n \ix]xszs+i)^_,^ 


S:ts = 1 


- 1 ,- 1 ) 
2 


- 1/2 


X exp 


W + woY^nsiys - ctor - ^S:ts=i (^siX]Xs + Zg )(_i _i)^5 

2^2 


Averaging over oq, a'^, and finally wq, we get for 6 > 1, 


f{y\x,p,w,T) OC- 


w, 


/(6-l)/2 ■ 
0 


ns:., = l (AJAsZ5 + /)(-!,-1) 


- 1/2 


5(^>+l)/2V^(n-b-2)/2 

^ f BwUW 6 + 1 n — b — 2 
\l + BwyW 2 2 


Again, we consider the two cases: B = 0 versus B > 0. For S > 0, 


f{p, r\y, X, w) (xf{T\p)f{p)f{y\x, p, w, r) 
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ocll 

5 L 


X Oi 
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Kp) 


ns + d 
ns 


ns + d 


1{?T-S > 2A:} + tins' < 2A:} ) l{rs = 0} 
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l{ns > 2/c}^ l{rs = 1} 


n5:r,=l (XjXsZs + /)(_!,_i) 


- 1/2 


X Beta 


B{b+l)/2win-b-2)/2 
Bw'qIW 6 + 1 n — 6 — 2 


y + Bwyw 2 2 y 

It is not possible to average over w to get a closed form expression for f{p,T\y,x); we can get a 
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closed form expression for f{w\y,x,p,T), which for B > 0 is 


If instead B 


f{w\y,x,p,T) (x.f{y\x,p,w,T)f{w) 


w. 


!(b-l)/2 ■ 
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oc- 




- 1/2 


X Beta 


B{b+l)/2win-b-2)/2 
BwqIW 6 + 1 n — b — 2 


I + Bw'q/W 


’ 2 ’ 


= 0, then 

f{p,r\y,x,w) ocJJ 
S L 


d 


ns + d 


> 2A:} + dins' < 2A:} ) l{rs = 0} 


+ 


ns 


ns + d 


l{ns > 2k} l{rs = 1} 


X Oi 


W'n 


dp) 


(XjXsZs + d)(-i,-i) 


- 1/2 


I++-l)/2 


and 


f{w\y,x,p,T) oc- 


Wn 


(XjXsZs+ /)(_!,_i) 


- 1/2 


I++-l)/2 

APPENDIX B. SUPPLEMENTARY MATERIALS 
[Figure 7 about here.] 


[Figure 8 about here.] 
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List of Figures 

Two simulated examples: (a) univariate data yi ~ N{9i, 1), where Oi = 0 for i < 50 
and = 1 for i > 50; (b) multivariate data Pi ~ N^{6i, I), where 6i = (0, 0,0, 0,0)"'' 
for i < 50 and Oi = (1,1,1,1,1)^ for i > 50. The blockwise means within each 
dimension were assumed equal for this particular example, but in general, the block- 
wise means are unconstrained. The top panels show the simulated data and the 
posterior means at each location in the series, and the bottom panels show the pos¬ 
terior probability of a change point. 

Quebec streamflow example. The year 1984 has a high posterior probability of being 

a change point. 

Scene partition boundaries and block means for simulations on 20-by-20 grid graphs. 
An error variance of ci^ = 1 is used throughout. The true number of blocks b is given 

in parentheses. 

Mean squared error (MSE) for all simulated grid scenes over a large range of values 
for the parameter a. Generally, small values of a produce lower MSE. However, the 
contrived large-boundary scenes marked with asterisks attain lower MSEs at higher 

values of a. 

New Haven housing example: minimum spanning tree joining all houses (in circles, 
with size indicating the magnitude of the true assessed value, logval) with dashed 
lines demarcating the region of interest, colored by neighborhood membership in 
the posterior modal partition. Neighborhood 3 houses he on busy Prospect Street; 
upscale neighborhood 1 houses lie on quiet Edgehill Road or Huntington Street; 
neighborhood 2 is a dense neighborhood of generally smaller houses on smaller parcels 

of land. 

Residuals by longitude for the New Haven housing example: (a) residuals from a 
linear model {SE = 0.200) without any spatial or change point structure; (b) resid¬ 
uals from a linear model that estimates a different intercept for each neighborhood 
as identified by the posterior modal partition presented in Figure [5] {SE = 0.100); 
(c) residuals from our method {SE = 0.092). Residuals are colored for the three 

neighborhoods in the modal partition. 

Mean runtime (seconds) for all simulated grid scenes. Scenes marked with asterisks 

generally attain lower MSEs at higher values of a. 

Mean number of blocks per iteration for all simulated grid scenes. Scenes marked 
with asterisks generally attain lower MSEs at higher values of a. 
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Univariate Change Point Example 



Location 

(a) 


Multivariate (k=5) Change Point Example 



(b) 


Figure 1: Two simulated examples: (a) univariate data y* ~ N{6i, 1), where 0* = 0 for i < 50 and 
di = 1 for z > 50; (b) multivariate data yi where = (0, 0,0, 0,0)"'' for z < 50 and 

= for i > 50. The blockwise means within each dimension were assumed equal for 

this particular example, but in general, the blockwise means are unconstrained. The top panels 
show the simulated data and the posterior means at each location in the series, and the bottom 
panels show the posterior probability of a change point. 
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Posterior Probability Posterior Means 


Quebec River Streamflow Change Point Anaiysis 



Figure 2: Quebec streamflow example. The year 1984 has a high posterior probability of being a 
change point. 
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Figure 3: Scene partition boundaries and block means for simulations on 20-by-20 grid graphs. An 
error variance of ci^ = 1 is used throughout. The true number of blocks b is given in parentheses. 
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Figure 4: Mean squared error (MSE) for all simulated grid scenes over a large range of values 
for the parameter a. Generally, small values of a produce lower MSE. However, the contrived 
large-boundary scenes marked with asterisks attain lower MSEs at higher values of a. 
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Figure 5: New Haven housing example: minimum spanning tree joining all houses (in circles, with 
size indicating the magnitude of the true assessed value, logval) with dashed lines demarcating the 
region of interest, colored by neighborhood membership in the posterior modal partition. Neigh¬ 
borhood 3 houses lie on busy Prospect Street; upscale neighborhood 1 houses lie on quiet Edgehill 
Road or Huntington Street; neighborhood 2 is a dense neighborhood of generally smaller houses on 
smaller parcels of land. 
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Figure 6: Residuals by longitude for the New Haven housing example: (a) residuals from a linear 
model {SE = 0.200) without any spatial or change point structure; (b) residuals from a linear 
model that estimates a different intercept for each neighborhood as identified by the posterior 
modal partition presented in Figure E](S'S = 0.100); (c) residuals from our method {SE = 0.092). 
Residuals are colored for the three neighborhoods in the modal partition. 
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Figure 7: Mean runtime (seconds) for all simulated grid scenes. Scenes marked with asterisks 
generally attain lower MSEs at higher values of a. 
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Figure 8: Mean number of blocks per iteration for all simulated grid scenes. Scenes marked with 
asterisks generally attain lower MSEs at higher values of a. 
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